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Abstract 

We consider penalized estimation in hidden Markov models (HMMs) with multi- 
variate Normal observations. In the moderate-to-large dimensional setting, estimation 
for HMMs remains challenging in practice, due to several concerns arising from the 
hidden nature of the states. We address these concerns by ^-penalization of state- 
specific inverse covariance matrices. Penalized estimation leads to sparse inverse co- 
variance matrices which can be interpreted as state-specific conditional independence 
graphs. Penalization is non-trivial in this latent variable setting; we propose a penalty 
that automatically adapts to number of states K and state-specific sample size and 
can cope with scaling issues arising from the unknown states. The methodology is 
adaptive and very general, applying in particular to both low- and high-dimensional 
settings without requiring hand tuning. Furthermore, our approach facilitates ex- 
ploration of the number of states K by coupling estimation for successive candidate 
values K. Empirical results on simulated examples demonstrate the effectiveness of 
the proposed approach. In a challenging real data example from genome biology, we 
demonstrate the ability of our approach to yields gains in predictive power and deliver 
richer estimates than existing methods. 

Keywords HMM, Graphical Lasso, Universal Regularization, Model Selection, 
MMDL, Greedy Backwards Pruning, Genome Biology, Chromatin Modeling 
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1 Introduction 



In this paper we consider estimation in high-dimensional hidden Markov models. We 
consider multivariate observations Xt £ W with discrete index t € T = {1 . . . n} and 
hidden states St G {1 • • • K}. Conditional on state, emission distributions are multivariate 
Normal (MVN), with X t \ S t = k ~ 7V(/i fc , S fe ) (where, S) denotes the MVN density 
with mean and covariance matrix £). Estimation in the small p case of univariate or 
low-dimensional observations is a well-studied problem. In contrast, estimation in the 
larger p setting remains challenging due to several factors: 

• High- dimensionality. Inference in HMMs with moderate or large number of features 
is, in a sense, always a high-dimensional problem since the ratio minnk/p may be 

k 

small, as it depends on the unknown number of states and the unknown size of the 
states (rifc denotes the number of samples in state k). Therefore, large samples for 
each state cannot be relied upon at the outset, even when the overall sample size 
n = Yjk n k is large. 

• Covariance structure. Estimation is especially challenging in settings where covari- 
ances cannot be assumed to have a simple structure (e.g. diagonal) or where 
state-specific covariance structure is itself of scientific interest. Then, due to Simp- 
son's paradox, state-specific covariances must be jointly estimated along with state 
assignments. 

• Number of hidden states. The model selection problem of determining or explor- 
ing the number of states K is coupled to the estimation problem for known K. In 
the multivariate setting, estimation for known K is itself challenging. Then, the 
straightforward strategy of fitting models for various values K and comparing by 
model selection criteria may become difficult or intractable, especially when practi- 
cally important issues like initialization and setting of tuning parameters are taken 
into consideration. 

We propose a penalized log-likelihood procedure involving £i-norms of the state-specific 
inverse covariance matrices S^" 1 , with optimization carried out within an expectation- 
maximization (EM) framework. Our approach has several attractive features: 

• Penalized estimation leads to sparse inverse covariance matrices whi ch can be inter 



preted as state-specifi c cond itional independence graphs or networks (lYuan and Lin , 
20071 : iFriedman etaR 120081 ) . 



The specific penalty we propose automatically adapts to the number of states and 
state-specific sample size and enjoys scale invariance that takes care of state-specific 
scaling. 

The number of states K can be selected automatically, or estimates for various values 
K explored, using a computationally efficient approach that couples estimation for 
successive candidate values for K. 

The approach requires essentially no hand tuning; only a maximum number of states 
K max must be set by the user. Otherwise, tuning parameters (including, if desired, 
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K itself) are set automatically. 



Our approach is very general: as we demonstrate below it works well in diverse regimes, 
including both low- and high-dimensional examples, with no hand-tuning required. In a 
real data example from genomics the methodology leads to large gains in predictive power 
relative to existing approaches. 

Penalized estimators can be incorporated into EM-type algorithms an d a number of re- 
cent authors have done so , notably in the conte xt of mixture models (jKhalili and Chenl . 
20071 : IStadler et all [201CJ; |Pan and Shenl . 120071 ). However, the unknown nature of the 
states (or mixture components) poses special challenges for penalization that have not 
been adequately addressed so far. In particular, appropriate penalization must account 
for the number of hidden states and their respective sample sizes, but these are themselves 
unknown at the outset. Fur thermore, scaling als o poses a subtle proble m: in the classi- 
cal Lasso ( Tibshirani . 19961 ) or Graphical llsso Friedman et a/J .B standardization 
is an important pre-processing step to ensure appropriate scaling. However, in HMMs 
and mixtures different states or components may differ with respect to scale, but since 
state assignments are a priori unknown, standardization cannot be carried out as a pre- 
processing step. The penalty we propose automatically a dapts with state-sizes and take s 
care of scaling issues. Inspired by the seminal paper of iDonoho and Johnstond (119 94) 



and r elated work in the Lasso context (|Zhanel . l2010i : ISun and Zhang 



2011 



Barron et al. 



20081 ). our penalty allows for universal regularization by use of a tuning parameter A un] 
that depends only on n and p. Using universal regularization by A un j within our EM 
algorithm allows automatic adaptation to number of states K and state-specific sample 
sizes. As a consequence of these features, our procedure for penalized estimation for a 
given number of states K is entirely free of user-set parameters. 

Parameter estimates for successive values K, K + 1 are related, and it is therefore natural 
to exploit this fact in exploring the number of states; we do so using an iterative algorithm. 
In principle, an iterative approach could proceed in a "top down" manner from few states 
to many, or "bottom up" from many states to few. However, we cannot in general gain 
information about two underlying states from estimates obtained from a single, merged 
state (Simpson's paradox); this means the "top down" approach cannot be reliably used 
in the multivariate setting. We therefore proceed in a "bottom up" manner, starting with 
a large number of states K max and iteratively reducing the number of states through the 
entire considered range. Model order reduction is guided using the Kullback-Leibler diver- 
gence between state densities; this naturally takes account of both mean and covariance 
information. This exploration is efficient because (i) current estimates are used to provide 
initialization for the subsequent iteration and (ii) we initialize the EM algorithm only once, 
at the first iteration corresponding to K = K max . As we demonstrate below, this proce- 
dure in fact outperforms the "brute-force" approach of entirely separately fitting models 
for various K's. In this way, our approach allows tractable exploration of estimates for 
a range of valu es K and, if desired, autom atic selection of K. Our approach is inspired 
by the work of iFigueiredo and Jain (120001 ) who used a similar strategy in the context of 
low-dimensional mixtures. 

Applications for high-dimensional HMMs are numerous, in fields ranging from engineering 
to biology. The original motivation for our work comes from genome biology and we 
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illustrate our methods on a real data example from that field. HMMs are very widely 
used in genomics. Measurements at genome locations t constitute the vector Xt while 
states St are identified with biological states of the genome (e.g. whether the location 
t is within a gene-coding region) . Early, pioneering applications of HMMs to genomic 
data (see e.g. iDurbin et all Il998l ) considered univariate or low-dimensional observations 
Xt (such as the gene sequence itself). However, in recent years technological advances 
have begun to permit higher dimensional st udies. For ex ample , using technologies such 
as DamID ( van Steensel and Henikofj l200d ) or ChlP-seq ((Park], |2009j) it is now possible 
to measure the binding of proteins to the DNA across the entire genome for dozens or 
hundreds of proteins and the dimensionality (i.e. number of proteins) of such approaches 
continues to increase. However, absent reliable methodology for fitting high-dimensional 
HMMs, it is common practice in the field to instead consider reduced dimension versions of 
the data (by selecting key "m arker" variables or carrying out dimensionality reduction as 
a pre-processing step, see e.g. iFilion et al\ (1201010 or by discret izing the data and treating 
observations as independent Bernoulli ( Ernst and Kellis . 2O10l ). We show below in a real 
data example from genome biology that our penalized approach applied to all available 
variables (proteins) from a recent experiment yields large gains in predictive accuracy (on 
held-out test data) relative to a reduced-dimension approach, as well as relative to classical 
estimation applied to the full set of variables. 



2 Inference in hidden Markov models with state-specific 
graphical models 

We consider a hidden Markov model (HMM) with multivariate Normal (MVN) emissions. 
We denote by St £ {1, . . . , K} the (hidden) state process, i.e., a discrete Markov chain with 
transition matrix Hkk' = P(<St+i = k'\St = k); in order to simplify the notation we omit 
the initial probabilities pk = P(<Si = k) in the further description of our methodology. 
We denote by X t S W the observed process with emission distribution Xt | St = k ~ 

The case of sparse inverse covariance matrices O^. = S^ 1 will be of particular interest. 
For each state we have a Gaussian graphical model with undirected graph Gk defined by 
locations of zero entries in the inverse covariance matrix, i.e. (I, I') ^ Gk <^=^ (flk)w = 0. 
We denote model parameters by ®k = (0\, . . . , 0R-,n), 0^ = (fJ-k, &k)- The goal, for given 
K, is to infer @k from the observed nx p data matrix X, and further to solve the related 
problem of exploring (or determining) K itself. 

As noted in the Introduction, inference for multivariate HMMs is challenging due to dif- 
ficulties arising from the unknown states. To motivate the methods described in this 
Section, we gather these points together, highlighting four main difficulties/concerns: 

(i) High- dimensionality. Inference in HMMs with moderate or large number of features 
is, in a sense, always high-dimensional, as the ratio min ^ may be small, and depends 

on the number of states and their size, both of which are usually unknown at the 
outset. 
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(ii) State-specific covariance structure. When state-specific covariance structure is of 
interest or cannot be assumed to be diagonal, estimation is challenging. Importantly, 
due to Simpson's paradox, covariances must be jointly estimated along with other 
parameters. 

(iii) Regular ization. The size and scale of individual states may vary and are usually 
unknown at the outset. Regularization schemes need to self- adapt appropriately. 

(iv) Model order exploration. In the HMM setting, model selection criteria are a function 
of both number of states K and amount of regularization A. However, brute force 
search over (K, A) may become intractable in practice, especially since at each grid 
point, multiple initializations are needed to guard against local optima. 

Points (i)-(iv) above are coupled and are difficult or impossible to address individually. 
The methodology we propose aims to address all these concerns simultaneously. We first 
outline the approach at a high-level and then describe the algorithms in detail. 

Conceptually, it makes sense to think of inference in a HMM (or mixture model) as a com- 
bination of two (coupled) tasks. The first task consists of estimating the model parameter 
@K, given the number of states K and a regularization parameter A. For this task, we 
propose to minimize the negative penalized log-likelihood 

@ K ,x = argmin-^(e^ A ;X) + Apen(e^ A ), (2.1) 
where £(@k,\\ X) denotes the observed log-likelihood an d pen(6^ t \) is a penalty function 



involving the i\ -norms of the inverse covar iance matrices ([Yuan and Linl . 120071 ; iFriedman et al. 



20081 : Meinshausen and Biihlmann . 20061 ) that we describe in detail below. The l\ 



•norm 

is especially appealing when the goal is network inference, as it induces sparsity in fi^'s 
and therefore in the corresponding undirected graphs Gk- We solve this problem by an 
EM-type algorithm, using a specific penalty that we describe below; we call this approach 
HMMGLasso (see Section O for details). As with every EM algorithm, HMMGLasso 
only converges to a local optimum and depends on initialization. 

The second task involves determining an appropriate number of states K*, and suitable 
penalization parameter A*. This is a model selection problem, and can in principle be 
solved by minimizing a model selection criterion C(K, A) (we consider specific criteria 
below), i.e., 

(K*, A*) = argmin C(K, A). (2.2) 

K,\ 



Thus, estimation in multivariate HMMs involves two coupled tasks whose solution must 
address the concerns (i)-(iv) listed above. In outline, we proceed as follows. We propose 
a universal regularization level A un ; that can be calculated analytically and removes the 
need for brute force search over A. The idea is, that solving (|2.ip with the HMMGlasso 
penalty function that we describe in Section \2. II and with A = A un ; (see Section T2.2p should 
provide for each fixed K a close-to-optimal solution for that specific K. In this way, use 
of universal regularization A un j reduces solving (j2.2[) to a search over K only. 
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We also exploit the relationship between estimates Qk for successive -RT's to allow efficient 
model exploration and, if desired, determination of K. Specifically, we start with a (too) 
large number of states K max and then successively reduce the model size by merge/delete 
operations described below. In this way we move towards a minimum number of states 
-Kmirn obtaining estimates for all considered values K m \ n < K < K max , but initializing 
only once, at K = K max . If desired, a specific K can then be chosen to minimize model 
selection criterion C(-,A un j). We call this approach Greedy Backward Pruning. As 
we show below, following the Greedy Backward Pruning strategy gives highly competitive 
estimates, despite initializing only once in the entire procedure. 

In summary, the adaptive regularization strategy we propose in HMMGLasso permits esti- 
mation of HMMs with state-specific covariance structure in both low- and high-dimensional 
settings, whilst taking care of state size and scaling; this addresses points (i)-(iii) above. 
Greedy Backward Pruning, with initialization only at K = K mSuX , takes care of point (iv). 

The remainder of this Section provides a detailed description of our algorithms. We discuss 
in turn HMMGLasso (Section l2.ip . universal regularization (Section 12. 2\i and model order 
exploration by Greedy Backward Pruning (Section I2.3H . 



2.1 HMMGLasso in detail: Baum- Welch algorithm and l\ regularization 

Maximum likelihood estimation for HMM is usually performed using the EM algorithm 
(or Baum- Welch algorithm in the HMM context). Let 4(6; X, S) 

4(e;X,S) = ^% fe ,O fe ;T 1 ,T 2 )+^n;T 3 ), (2.3) 

k 

where S = (S\, . . . , S n ) are state assignments, X = (X\, . . . , X n ) T is the nxp data matrix, 
£(Hk, Qk', Ti, T2) is the log-likelihood of the MVN distribution with mean and inverse 
covariance $1^ and T3) is the log-likelihood of the Markov Chain with transition matrix 
n. Ti = X T 1,T2 = XX T and (T^)kk' = Ylt^-i^t = k,St+i = k') are the corresponding 
sufficient statistics. 

Following initialization, EM produces a sequence of estimates {0W;i = 1,2,3,...} by 
alternating between E- and M-steps. To facilitate network inference, we seek to induce 
sparsity in the fi^'s. We do this by ^-regularization. In particular, we replace maximiza- 
tion with respect to (/i^, Q,^] in the M-Step of the Baum- Welch algorithm by 

(/4 i+1) ,^ +1) ) = argmin-^^TiST^) + A^^en^). (2.4) 

Here, 

Tf = ^(t)X tt Tf = ^4\t)X t xT 
t t 

denote the expected sufficient statistics given X and current estimate 6^ with state- 
responsibilities ujj, (i) = Pe(i)(St = &|X) obtained from the E-Step. 
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By TTjfc = h St u fc^(*) we denote the (scaled) effective sample size of state k. The penalty 

(i) 

term depends on a regularization parameter A, on the effective sample size 7r k and on a 
function Pen(-) involving £i-norm of Slfc. The reason why we incorporate the square root 
of the effective sample size is that it is known from the Lasso literature that the £i-penalty 
term asy mptotically has to grow with squar e root of the sample size in order to achieve op- 
timality (jBiihlmann and van de Geerllioilh . We consider three slightly different functions 



Pcn(-) defined as follows: 

• Peni nvcov (f2) = the classical penalty known from the Graphical Lasso. It 
imposes ^-constraints on the non-diagonal entries of the concentration matrix Q. 

• Pen parcor (il) = H^ - ||i, where ^ is the partial correlation matrix which can be written 

as = Qui I '. 

• Peni nvcor (f2) = ||$ — where $ is the inverse of the correlation matrix given by 

Note that all three functions penalize the ^i-norm of the concentration matrix and therefore 
lead to sparse fTs. The advantage of Pen parcor (-) and Peni nvcor (-) is that they are scale- 
invariant and therefore remove concerns that arise from state-specific scaling. As we noted 
above, state-specific scaling cannot be removed by pre-processing in the HMM setting since 
state assignments are themselves unknown at the outset. 

Optimization of (|2.4j) is non-standard. It is easy to verify that f)2.4[) reduces to: 

= Ti*°/n, = argmin-log ||O fe || + tr(O fe C u ") + 2—^^Pea(Q k ) (2.5) 

fj, k ,n k n k 

where C u fc ) = — fJ-k +1 \^k +1 " > T ■ For the penalty function Peni nvcov (^optimization 



proble m (|2.5|) can be solved by the Graphical Lasso algorithm presented in lFriedman et al. 



( 20081 ) . In the Appendix we compare these three different penalties and discuss how we 



perform optimization. 

Algorithm 1 summarizes HMMGLasso. As stated above the EM algorithm depends on 
initial specification of parameters, i.e., 9^\ll^ (k = 1, . . . ,K). For convenience (see later 

in text) we directly specify uj^(i) (instead of 6^) and start with an M-Step followed by 
an E-Step. We stop the algorithm if the relative change in the S^'s falls below a threshold 
e or if for at least one state the scaled effective sample size nk is smaller than 7r m i n . 



2.2 Universal regularization 



In this Section we discuss the choice of the regularization parameter A in HMMGLasso. 
We will argue that A un j = \J2n log p/2 is a reasonable regulariz ation parameter fo r HMM- 
GLasso. We do this by cons idering connections wi th the Lasso ( Tibshirani . 19961 ) and the 
Graphical Lasso (or GLasso: iFriedman et aZ.L I20Q8I ) . In the classical Lasso or GLasso setup 
the regularization parameter is usually chosen empirically to minimize the prediction er- 
ror (for example by performing cross-validation). However, in the much more complicated 
HMM (or more generally latent variable) setting, with unknown number of states K, such 
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Algorithm 1 HMMGLasso 



Input K, A, T(°) = {(4 0) (t)) fe=1 . .^ er ,n( ),7r( )} and set i = 0,err(°) = 0. 
1: while {errW < e}V{7r^ > 7r min for all k = 1 ... K} do 
2: M-Step Obtain estimates 



= argmin-% fc ,O fc ;T" k ,T" k ) + Aff'Pen^) 

niS 1} = vW /*« (H« = ngg in 1st iteration) 
3: E-Step Use Forward-Backward equations to update 

u ( £ +1) (t) = P e(l+1) (S t = k\x) 
4 m) = E t 4 m) W/- 

4: Set err( i+1 ) = max <^ ^ and i <- t + 1 

5: end while 

Output H^> A ) = {@K,x,(Ak(t))k=i..K,teT^} 



a brute force strategy is computationally burdensome, motivating the need for universal 
regularization. 

First, consider a classical regression setup with y = X/3+e, where e ~ A/"(0, a 2 T). Here, X is 
a iVxp predictor matrix, y a iVx 1 response vector, /3 denotes the px 1 regression parameter 
and o is the error variance. Then, the Lasso estimator minimizes \\y — X/3|| 2 /2A^ + .s| 



Assuming an orthonormal predictor matrix, iDonoho and Johnstond ()1994l ) showed that 



the risk of the Lasso estimator comes close to the oracle risk if we use s un ; = ay/2 logp/N 
as a regularization parameter. Universal regul arization and the penalty (7^/2 log p/iV is 
discussed also in the non-orthonormal case in IZhanej (1201(1 ) or ISun and Zhan j (|20llh 



(see also Barron et al. ( 20081 ); they propose a universal penalty parameter based on the 



minimum description length principle). It is important to note that s un i decreases with 
1 / y/~N. This is the reason why we include the square-root of the effective sample size into 
the state-specific penalty terms in the HMMGLasso (see Section [2.ip . 

Next, consider the Graphical Lasso, 

f2 = argmin — log |0| — tr(Sfi) + yo 1 1 ST2 1 1 , 

o 

where S is the s a mple covariance matrix of X = (iW,...,!^) ~ Af(0,£) with O = XT 1 . 
Friedman et al. ( 20081 ) showed that the last row/column of Cl can be obtained by solving 



$ = argmin0.5/3S 11 /3 + /3si 2 + p||/3|| 1 , (2.6) 

P 

where /3 and £1 are linked through 012 = £n/3/2 (En is the covariance matrix with 
the last row and column deleted; o\i and S12 denote the last row of the covariance and 
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sample covariance matrix). Note that (|2.6p can be interpreted as the Lasso estimator 
corresponding to regression of variable against X^\ X^~ l \ As n pp is the error 
variance m regressing against we can identify Qpp^ 2 y/2logp/N as 

a good choice for p in ()2.6[) . If SI is standardized to have unit diagonal entries, then we 
can write p un \ = \J2 logp/N. 

Now consider equation (|2.5|) of the HMMGLasso with Peni nvcov (-) and assume all fifc's 

are standardized to have unit diagonal. Equating 2^-\Jtt^ with p un \ = y^2 log p/n^ (the 
universal shrinkage level in the Graphical Lasso with sample size N = n^) and solving for 
A we obtain 

A uni = sjln logp/2. 

For the penalty function Peni nvcov (-) the foregoing indicates that A un i = \J2n log p/2 only 
holds if the Sl^'s are standardized and therefore equal the corresponding partial correla- 
tion matrix. In general, since state assignments are themselves unknown, this standard- 
ization cannot be done as a pre-processing step. However, if we use Pen parcor (-) instead, 
A un i = \J2n logp/2 applies regardless of scaling. Penalizing the partial correlatio n can be 
seen as a generalization of the "scaled" Lasso proposed by Stadler et al. ( 20ld ). There, 



the negative log-likelihood is penalized by s-^- and optimization is performed over /3 and 
a simultaneously. A reasonable choice for s is ■\/2\ogp/~N \ w hich does not depend any- 



more on the unkno wn noise level (see ISun and Zhangl (|201ll ) and also the discussion in 
Stadler et al] (|2010h l. 



Thus, A un i is the penalty level we use for estimation in HMMGLasso. It is "universal" in the 
sense that it only depends on the dimensionality of the input data n and p. Furthermore, 
when Auni is used with the penalty Pen parcor (-) the penalization self-adapts to the hidden 
states by incorporating the square-root of the effective sample size and by taking care of 
scaling. 

2.3 Model order exploration using Greedy Backward Pruning 

Greedy Backward Pruning can in principle be used with a wide range of model selection 
criteria; here we consider the popular Bayesian Information Criterion (B IC) and the Mix 



ture M inimum Description Length (MMDL). MMDL was introduced by iFigueiredo et al. 



(|l999h and was specifically proposed for the purpose of determining the number of compo 



nents in finite mixtures. We first describe these criteria and then go on to give a detailed 
description of the Greedy Backward Pruning algorithm. 

Model selection criteria. A model selection criterion C has to trade off goodness-of-fit 
and model complexity. BIC and MMDL are defined by 

BIC(e^ jA ) = -£(&k,x; X) + i \og(n)K{K + \ log(n) £ Df (fc, A) 

k 

MMDL(6 k ,a) = -£{Qk s x; X) + i log(n)K(K - 1) + £ i log(mr fc )Df (*, A), 

k 



9 



where in the context of t\ penalized log-likelihood we set the degrees of freedom as 

w(M)=p+EW 



v>i 



MMDL can be motivated by the minimum description length principle (jGrunwaldl . 120071 ) . 
The negative log-likelihood represents the optimal code-length of the data given model 
parameters 0. The term ^ log (n)K (K — 1) is the "optimal" code- length for the transition 
matrix II (note that II is estimated from all data) . As wK k is the effective sample size from 
which 9 k = (nk^k) is estimated we get \ log(n7T£.)Df (k, A) as an "optimal" code-length 
for describing 9 k . 

The main difference between BIC and MMDL is the use o f the effective sample siz e n%k in 
the code-lengths for parameters which are state-specific. iFigueiredo et al. argued 
using ideas from minimum description length literature that MMDL is more appropriate 
for mixtures than BIC. They demonstrate on real and synthetic data that MMDL out- 
performs BIC. In Section we compare performance of Greedy Backward Pruning using 
BIC and MMDL as mode l selection criteria . In o ur more involved inference task we come 
to the same conclusion as IFigueiredo et al. (|l999l ). namely that MMDL outperforms BIC. 



Greedy Backward Pruning in detail. Greedy Backward Pruning works by first es- 
timating parameters using HMMGLasso with a large number of states K me ^ and then 
iteratively reducing the number of states until some minimal number of states K m [ n is 
reached. Each iteration involves either merging the two "closest" states or deleting the 
"smallest" state, and then re-running HMMGLasso with one fewer state, using estimates 
from the previous step as initialization. This scheme is summarized in Algorithm [2j 

We give now a definition of "smallest" state and "closest" states and describe the "merge" 
and "delete" operation in detail. Let Ok be the current estimate for K states. The merge 
operation consists of detecting the two closest states k\ and hi defined as 

(h,k 2 )= argmin V s (6 k \\6 k ,), 
k,k'£{l,...,K} 

where T) s (6k\\0k') is the symmetric Kullback-Leibler divergence given by 

v s (e k \\e k ,) = tr{(s fc - E fc 0(E* - ^k 1 )} + (m* - Vk>) T (z k 1 - - m*0- 

We merge states k\ and k<i into a new state (denoted by k\ U k<i) by forming new initial 
conditions for the next run of HMMGLasso with K — 1 states. In particular we compute 
merged responsibilities as 

Umer feiUfc 2 (*) = %1 (*) + U ^2 (*) 

Umer k(t) = Ujfc(t) (for fc^fclUfe) 

and get a merged transition matrix using updates 

n mer k!Uk 2 ,k' = A fclfc / + fl k2k > (for k' ^ h U fc 2 ) 

n mer k,k' = n fejjfc / (for k', fc^fciU k 2 ) 
n mer fc',fciufc 2 = 1 /( K ~ 1 ) (fork' = 1,...,K -1). 
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All these operations are based on the relation: ~P(St : 
k 2 \-). 



:k 1 US t = k 2 \-)=P(S t = k 1 \-)+P(S t 



The delete operation simply discards the smallest state according to mim-gn ...,#-} ^k- I m_ 
tial conditions Ud e i 5 n^ei arising from deleting a state are derived by omitting correspond- 
ing row/ column of u, II and renormalizing these quantities such that rows sum up to one. 

Note that Greedy Backward Pruning algorithm needs to be initialized only once, namely 
at K ma _ x . Further, we note from Algorithm [2] that we decide between the "merging" 
and "deleting" operations based on the model selection criterion, i.e., if initial conditions 
obtained from merging leads to estimate with smaller criterion C we choose that solution 
otherwise we take the solution obtained from the "delete" operation. As demonstrated 
in examples below, Greedy Backward Pruning with only a single initialization at large 
Km&x yields remarkably good est imates in the unknown K case. Our procedur e origi nates 
from the algorithm s proposed in Figueiredo et al. ( 19991 ). Figueiredo and Jain ( 200d ) and 
Biceeo et all l|200al ). Our empirical results below echo the findings of these authors that 
Greedy Backward Pruning-like approaches can confer robustness to initialization. 



Algorithm 2 Greedy Backward Pruning with HMMGLasso 
Input K min and K max . 

Initialization of T^ m -) = {(u fc (t)) fc= i.. A - maxjter , n, vr}. 

1: Fit HMMGLasso and obtain: H^ max > A ™) <- HMMGLasso (£T max , A uni , T^ max )). 

2: Set k = K max . 

3: while k > K m [ n do 
4: Merge Or Delete 

Compute merged/deleted initial conditions: T mer and Tdei- 
Compute H mer <(— HMMGLasso(K — 1, A un j, T mer ) 
Compute Hdei HMMGLasso(K — 1, A un j, Tdei)- 

5: Update: 

Set K i — K — 1. 

Set H( K < A -i) 4r- ~ mer if C(6 mer ) < C(9 d el). 

Set H( K ' A -i) Hdei if C(6 d ei) < C(6 mer ). 
6: end while 

7: Set: K opt = argminC(0 K)Aun .). 

K 

Output final estimates: ®K op t,\ UJli - 



3 Examples 

3.1 Simulation studies 

In this Section we describe data-generating models that we use for simulation examples. 
We consider the following data-generating models: 
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Model 1 K truc G {2,4,6},n = 2000,j> = 10, (n/p-ratio=200). 

Transition matrix. Tlkk' = O.I7 and = O.97, where k, k' G {1, . . . , K true } and 7 
is chosen such that ^a/=i c ^kk' = 1- 

Means fi^, k = 1, . . . , -RTtrue- Each state has p/Kt Iue nonzero entries with value 
(— \) k a>/ yjp/ K truc . Nonzero's are at different locations for each state. 

Concentration matrix Oj., k = 1, . . . , i^true- Each state has p nonzero (off-diagonal) 
entries. To reflect the setting in which states share some aspects of graphical model 
structure, p/2 non-zeros are shared between all states, whereas the other p/2 non- 
zeros are_jt_dj^erent l ocatio ns for each state. Concentration matrices are generated 



as m 



Rothman et al\ (|2008l ) but standardized to have unit diagonal entries. 



Model 2 As model 1 but with p = 75, (n/p-ratio=26 2/3) 

Model 3 As model 1 but with n = 1000 and p = 100 , (n/p-ratio=10) 

Model 4 K truc G {2,4,6},n = 5000,p = 50. 

Transition matrix. n fcA ./ = O.I7, U kk = O.97 for k / K tTUC ; TlK tIUC ,k> = V^tmc 
(A/ G {1, . . . , i^truc})- Again, 7 is chosen such that rows sum up to one. 

Means. (^k)l = a f° r I £ {!> 2} and k G {1, 2}. All other entries equal zero. 

Concentration matrix. For k = 1,2: = I p . For k = 3, . . . , i^truc : has two 
nonzero entries, at different locations for each state. Concentration matrices are 
standardized to have unit diagonal entries. 

Ideally we seek methodology that can automatically adapt to both low- and high-dimensional 
settings. Accordingly, Models 1, 2 and 3 have the same design but differ with respect to 
n/p-ratio. We include the small p, large n Model 1 as a baseline and to investigate the 
performance of universal regularization in the classical low-dimensional setting. Model 4 
is a challenging problem, similar in terms of n,p to the real, genomic data example below. 



Experiment I: Number of States. In this Experiment the focus is on state recovery. 
We explore the ability to estimate the correct number of states K and recover the state 
assignments. We compare the following methods 

• HMMGLasso initialized by Kmeans (Hmmgl) 

• HMMGLasso with Greedy Backward Pruning (Bwprun) 

• Unpenalized maximum likelihood estimation (MLE) (Unpen) 

• MLE with diagonal restricted covariance matrices (Diagcov) 



Mod el-based clustering via Gaussian mixture models (Mclust; iFraley and Raftery . 
20061 ) 



Thus, Hmmgl and Bwprun are the methods we propose. Both Hmmgl and Bwprun carry 
out estimation (for given K) using the penalty and universal regularization Vlcl Auni ttlclt 
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we put forward above; the former embeds our estimator within a standard, "brute-force" 
exploration of K, whilst the latter uses Greedy Backward Pruning. 

In all numerical experiments we stop the algorithms according to the rule described in 
Algorithm 1 with e = 10 -3 and vr m ; n = 5/n (for Unpen we use 7r m i n = p/n to ensure 
non-singular covariance estimates). For each method we use each of BIC and MMDL 
as model selection criteria. For Hmmgl, Unpen and Diagcov we compute estimates for 
K = 1, . . . ,i^true + 2 and pick the number of states minimi zing BIC or MMDL . As a 
reference, we also cluster the data using the R-package mc lust (jFralev and Raftervl . 120061 ) . 
We use the function Mclust; this employs Gaussian mixture models and uses BIC to 
automatically select between different covariance structures and numbers of clusters (we 
allow K = 1, . . . , .fCtruc + 2). We initialize Mclust using model based hierarchical clustering 
with equal spherical covariances (we note that the default initialization of Mclust, using 
hierarchical clustering with unconstrained cova r iances , performs worse in the examples 
below). For more details see iFralev and Raftervl (j2002l ). Specifications of all the methods 
are summarized in Table [TJ 



Method 


Selection Criterion C 


Bwprun 


BIC/MMDL 


Hmmgl 


BIC/MMDL 


Unpen 


BIC/MMDL 


Diagcov 


BIC/MMDL 


Mclust 


BIC 



Regularization /Constraints 



Initialization 



(Penp arcor , A un ;) 
(Penp arcor , A un ;) 
No constraints 
Diagonal covariances 
Various covariance structures 



KM (100 r.s.) at K max = 
KM (100 r.s.) 
KM (100 r.s.) 
KM (100 r.s.) 
Hierarchical clustering 
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(see lFralev and Raftervl (|2002j )) 



Table 1: Methods used in simulation Experiment I. [r.s. stands for random starts] 



We generated 50 datasets from each of Models 1-4 with a = 2 and report for all methods 
number of selected states and adjusted Rand index (this quantifies the extent to which 
estimated state assignments agree with true state membership). The results for Models 3 
and 4 are summarized in Figures Q] and [2j Figures [7] and [8] at the end of this manuscript 
show results for Models 1 and 2. 

In nearly all settings Diagcov is unable to recover the correct number of states and performs 
poorly in terms of adjusted Rand index. This is not surprising as Diagcov imposes incorrect 
model assumptions. Only in Model 3 with K tr ne = 2, where for both states the data 
generating covariance matrices are diagonal, does Diagcov perform well. MLE without 
penalization (Unpen) does well only in the low-dimensional Model 1. Both the proposed 
methods (Hmmgl and Bwprun) greatly outperform the other methods in Models 2-4. This 
supports the notion that regularization can be useful even when sample size n is seemingly 
large. 

HMMGLasso also works well in Model 1 with large n and very small p, a scenario where 
no constraints are necessary. This demonstrates that the adaptive strategy and universal 
regularization can be applied without any hand tuning also in the low-dimensional setting. 
We also read-off from Figures [Q[2] (see especially scenarios with K = 6) the substantial 
improvement of Greedy Backward Pruning relative to HMMGLasso, despite the fact that 
the latter carries out essentially a brute-force search over K. Also the use of MMDL 
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further improves performance (it never performs worse than BIC). Especially in tough 
and very high-dimensional scenarios (Model 3 and Model 4 with K=6) MMDL seems to 
perform better. 



Model 3: K=2 



unpen. b unpen. m hmmgl.b hmmgl.ir 

Model 3: K=4 



mclust diag.b 



unpen. b unpen. m hmmgl.b hmmgl.rr 

Model 3: K=6 



I 



ILj 



I 



mclust diag.b diag.m unpen. b unpen. m hmmgl.b hmmgl.rr 



Model 3: K=2 



mclust diag.b diag.m unpen. b unpen. m hmmgl.b hmmgl.m bw.b bw.ir 

Model 3: K=4 



unpen. b unpen. m hmmgl.b hmmgl.m bw.b bw.ir 

Model 3: K=6 



unpen. b unpen. m hmmgl.b hmmgl.m bw.b bw.ir 



Figure 1: Simulation Model 3 (p = 100, n = 1000), number of states and state assignments. 
Left panels: frequency of estimated number of states; in each case the correct number of 
states (i.e. number of states in data-generating model) is indicated in black. Right panels: 
adjusted Rand index with respect to true state assignments. [Legend: Results for Mclust 
(mclust), MLE with diagonal covariance matrices (diag), MLE (unpen) and Greedy 
Backward Pruning (bw) are shown. The extensions ".b" and ".m" stand for BIC and 
MMDL respectively] 
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Model 



4: K=2 



Model 4: K=2 




Figure 2: Simulation Model 4 (p = 50, n = 5000), number of states and state assignments. 
Left panels: frequency of estimated number of states (as in Fig Q] correct number of states 
indicated in black). Right panels: adjusted Rand index with respect to true state assign- 
ments. [Legend: Results for Mclust (mclust), MLE with diagonal covariance matrices 
(diag), MLE (unpen) and Greedy Backward Pruning (bw) are shown. Extensions ".b" 
and ".m" stand for BIC and MMDL respectively] 
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Experiment II: Graph Structure. In this experiment we focus on recovering state- 
specific graphical model structure. We consider Model 3 with K true € {2, 4, 6} and a € 
{2, 6, 10}. We compare Greedy Backward Pruning; HMMGLasso (K = K opt , Pen parcor , A un i); 
Kmeans (with number of clusters set to K = Kt rue ) followed by estimating cluster-specific 
inverse covariance matrices using Graphical Lasso; and Graphical Lasso using all samples 
(no state assignment or clustering). In Figure [3] True Positive Rate (TPR; with respect 
to edges in the data-generating graph) is plotted against the corresponding False Positive 
Rate (FPR) for all combinations of K and a and different methods. We note that Greedy 
Backward Pruning consistently selects the correct number of states in all scenarios except 
in (K tTVLC ,a) = (6,2) where it chooses K correctly in 36 out of 50 datasets. 

Greedy Backward Pruning performs well in terms of TPR and FPR. It is noteworthy 
that universal regularization using A un i gives consistently good results under a range of 
conditions. We see that HMMGLasso exhibits a smaller true positive rate in the most 
challenging -fCtruc = 6 case. For a = 2 Kmeans in combination with GLasso performs 
much worse in particular in terms of TPR. For larger a's (and therefore with increased 
information about state-assignment in the means) TPR and FPR of Kmeans improves. 
Finally, GLasso applied to all data without any clustering leads to very poor performance 
(this is likely a consequence of Simpson's paradox). 



3.2 Application to genomic data 



We conside r genome-wide bin d ing data for 5 3 pro teins in the Drosophila cell line Kcl67 
(data from IFilion et all l2010h . IFilion et all (j2Q10h represents an important step forward 
in the genome biology of Drosophila, showing for the first time how multivariate data can 
reveal protein-DNA binding patterns that depend on genome region. Here, we use this 
dataset to test our HMM methodology. The dataset offers a number of advantages for 
our purposes. First, the coverage of a relatively large number of proteins (p = 53) in 
the full data gives a high-dimensional example from current genome biology. Second the 
abundance of data (n = 33, 632 for chromosome 2L and n = 32, 791 for chromsome 2R) 
allows fully held-out validation on a large test set (we use the latter half of chromsome 2R, 
giving n tcs t = 16, 396) as well as exploration of the effect of (training) sample size. Finally, 
although substantive biological questions are beyond the scope of this paper, several open 
questions concerning genome organization in Drosophila, including the likely number of 
genome regions, and the possibility of region-specific protein-protein interplay, help to 
motivate the methodological questions we address here. 

Filion et all (j2Q10h identified regions of the genome by fitting a HMM (using classical, 
unpenalized estimation) to reduced-dimension data. Dimensionality reduction was carried 
out using principal component analysis (PCA) as a pre-processing step, with the HMM 
fitted to the first three principal components. Such appr oaches are cu r rently widely used 
in genome biology. By looking at principal components, IFilion et al. torn ) suggested a 
model with five states (corresponding to different chromatin types). They further noted 
that these five states are marked by enriched binding of the proteins HP1, PC, HI, BRM 
and MRG15 and that a 5-state HMM using only the five marker proteins as an input 
recapitulates 85.5% of the original state classification. 
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We investigated performance in a held-out predictive sense by training on the first n tra m = 
500, 1000, 2000, . . . , 5000 observations of chromosome 2L and then reporting the test log- 
likelihood obtained from the second half of chromosome 2R (n tes t = 16,396). As above 
we compare HMMGLasso (Hmmgl); Greedy Backward Pruning (Bwprun); unpenalized 
MLE (Unpen); and MLE with diagonal covariance matrices (Diagcov). Additionally 



we include a five-state MLE using only the five marker proteins reported by iFilion et al 



alpha=2 



alpha=6 



alpha=10 



r 






cc 

Q_ 
\— 


• 




CL 
Q_ 
I- 










o bwprun 
a hmmgl 
km+gl 

x gi 






















O 






o 









FPR 1 







FPR 1 







FPR 1 




Figure 3: Simulation experiment II, graphical model estimation. Comparing estimated 
state-specific conditional independence graphs against the data-generating graphs gave 
true positive and false positive rates with respect to edges in the graphs (TPR and FPR 
respectively). We show TPR plotted against FPR with K £ {2,4,6}, a £ {2,6,10} 
for Model 3. [Legend: Results for Greedy Backward Pruning (bwprun), HMMGLasso 
(hmmgl), Kmeans clustering with cluster- wise Graphical Lasso (km+glasso) and Graph- 
ical Lasso applied to non-clustered data (glasso) are shown] 
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( 2010l ) (Marker). For Hmmgl, Unpen and Diagcov the number of states is determined by 
exploring different -ftT's in a forward stepwise manner. We use MMDL and BIC as model 
selection criteria. All methods are initialized by Kmeans with initial centroids obtained 
using hierarchical clustering; this renders the overall analysis deterministic by removing 
variability due to random initialization of Kmeans. 

Figure H] shows the MMDL(BIC)-scores (scaled by n tra ; n ) and the negative test log- 
likelihood as a function of ntrain- Figure [5] depicts the selected number of states for 
each method and training sample size. Overall, we notice that MMDL (BIC) and test 
log-likelihood show similar patterns for different methods and different sample sizes. Bw- 
prun and Hmmgl greatly outperform Marker and Diagcov. This provides a topical example 
where a multivariate view (using all variables and modelling also state-specific covariances) 
improves out-of-sample predictive performance. The predictive gain of penalization com- 
pared to unpenalized MLE for moderate n/p-ratios is also noteworthy. As expected, the 
performance of Unpen in terms of MMDL (BIC) and test log-likelihood approaches the 
penalized methods with increasing sample size. However, in terms of number of states 
(Figure [5]) the estimates are very different even for large ntrainj i-e., penalization typically 
leads to more states than unpenalized MLE. This illustrates that the prediction-optimal 
number of states depends on the estimation procedure employed: regularization allows 
estimation for a greater number of states. If state-specific estimates have scientific rele- 
vance, this property can be important, since due to Simpson's paradox, estimates for finer 
state distinctions (larger K) cannot, in general, be recovered from coarser models (smaller 
K). We return to the question of exploration of number of states in Discussion below. 

We note that for each training sample size nt ra m the results shown in Figures HE] reflect 
performance for a single training sample of the specified length. For completeness, Figure[9] 
in the Appendix shows performance over 9 different training datasets of size ntrain = 1000. 

4 Discussion 

We considered penalized estimation in multivariate HMMs, including in particular the case 
of high dimensions and state-specific graphical models. As we demonstrated in simulated 
and real data examples, the methodology we propose substantially improves upon current 
practice. Our results demonstrate the utility of regularization for HMMs, even when 
sample sizes are not small. 

It is interesting to consider why careful penalization is needed in HMMs (and related latent 
variable settings like mixture models). In a simple linear model, as in regression, the ratio 
n/p is a measure to distinguish between a low- and high-dimensional problem. If the ratio 
n/p small, classical least-squares estimation leads to poor predictive performance due to 
a large number of predictors compared to a small sample size. On the other hand if n/p 
is large (for example > 20), then, very likely, least-squares regression performs well. In 
HMMs (and mixtures) the situation is more subtle. It is instructive to consider the ratios 
nk/p (nk denotes the number of samples belonging to state k) as a measure whether an 
inference problem is high-dimensional or not. If for at least one state this ratio is small, 
then MLE is likely to overfit and results in a poor generalization error. A fundamental 
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Figure 4: Genomic data, MMDL(BIC) and p redictive performance. Models were fitted to 
protein binding data from Filion et al. ( 201Cj| ) (see text for details) and tested on held-out 
data from the same study. Left panel: MMDL(BIC)-scores (scaled by retrain) f° r different 
methods trained on the first n tra m = 500, 1000, . . . , 5000 observations of chromosome 2L. 
Right panel: negative test log-likelihood evaluated on a test set (second half of chromosome 
2R; training data is from parts of chromosome 2L). [Legend: Greedy Backward Pruning 
(Bwprun); HMMGLasso (Hmmgl); Unpenalized MLE (Unpen); MLE with diagonal 
restricted covariance matrices (Diagcov); Five-state MLE using only marker proteins 
(Marker)] 



problem that we emphasized throughout the paper is the fact that the ratios n^/p depend 
on the number of states K and on the state-sizes n&, which are themselves usually unknown 
a priori. So, a seemingly low-dimensional problem with a large sample size and with a 
moderate number of features can become a high-dimensional task in practice, especially 
if a large number of states cannot be ruled out a priori. In fact, our simulations illustrate 
that even when min^ n^/p is relatively large, the MLE can be ill-behaved. For example, 
in our simulated Model 2, with K = 2, we have n = 2000 and n^jp > 13 in each state; 
nevertheless the MLE fails completely to recover correct state assignments (Fig El • 

A straightforward approach to handle inference in high-dimensional HMMs is to fix con- 
straints on the state-specific covariance matrices (for example assuming diagonal covari- 
ance matrices). However, such an approach leads to poor predictive performance when 
the assumption is invalid and precludes discovery of state-specific covariance structure. 
As in the genome biology example we considered, such structure may itself be of scientific 
interest. We note also that the hidden nature of the states makes it difficult to test any 
such model assumption. In fact, if the covariance matrices of an HMM with a specific 
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number of states satisfy some constraints, than these constraints do not necessarily hold 
for an HMM with smaller or larger number of states (Simpson's paradox). 

In the context of mixtures, there is a growing literature on pe nalized likelihood meth- 



ods which address the high-dimens i onal c ontext to some extent ([Khalili and Chenl . 120071 ; 



Stadler et ail I2OI0I : IPan and ShenL l2007h . However, none of these methods address the 



need to ensure penalties are able to handle state-specific scaling (that cannot be dealt 
with by pre-processing) and size (that is unknown at the outset). The selection of the 
number of mixture components also remains an open issue in this literature. Our ap- 
proach handles these issues that arise due to the hidden nature of the states and could be 
straightforwardly applied in the mixture model setting. Further generalization to other 
latent variable models may also be possible. 

The backward pruning approach gives an efficient way to estimate parameters for a se- 
quence of candidate number of states K. If desired, a single "optimal" number of states 
can then be selected using model selection criteria, as we demonstrated in examples above. 
For a given estimator, the optimal number of states is well defined in a predictive sense 
as the value that minimizes risk. From this point of view it is easy to understand why 
the prediction-optimal number of states may be higher under regularization or when more 
training data are available (see Figured]). However, when scientific understanding rather 
than prediction alone is one of the goals of analysis, it is not clear whether it is useful to 
think in terms of a "correct" number of states. Rather, it may be useful to think of the 
estimates {Ok} that we obtain via backward pruning as collectively providing a resource 
for exploration of a system of interest. 

In the genome biology example we considered, penalization led to gains in predictive 
ability relative to the MLE and to reduced dimension approaches that have been used in 
the literature. This suggests that despite redundancy in biological signals, a multivariate 
view can enhance predictive ability. Further, we were able to learn richer models than is 
possible using currently available methods, including estimates of state-specific graphical 
model structure. The latter may shed light on protein-protein interplay that is specific 
to genomic region; such interplay has not been investigated to date and is one foc us of 
our ongoing efforts in this application area. We used data from iFilion et al\ d2O10h : we 
note that the main substantive conclusions drawn in that paper are broadly supported 
by our analyses and the richer set of states uncovered by our approach are related to 
the states they report. Genomic datasets are becoming increasingly high-dimensional and 
we anticipate that the methodology presented here will be useful to researchers in that 
field. Beyond biology, applications for high-dimensional HMMs are numerous, including 
in signal processing. 

We showed that the approaches we put forward for HMMs, including universal regular- 
ization and Greedy Backward Pruning, work well in empirical examples. However, there 
remains a need for theoretical investigation of these ideas. Our penalty in combination 
with A un i was inspired by making connections to results obtained for the well-studied 
Lasso case. A challenge for future theoretical work is to provide insight into optimality 
of these and related approaches and establish global convergence properties of penalized 
estimation in latent variable settings. 
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A Graphical Lasso with different penalty functions 

In this section we briefly discuss optimization and performance of the Graphical Lasso 
problem 

fl = argmin - log \Q\ - tr(Sfi) + pPen(n) (A.7) 
n 

with the different penalty functions Pen(-) introduced in Section [2. li 



A.l Optimization 



. This case can be directly solved by the GLasso algorithm 
( 20081 ). This algorithm is implemented in the R-package 



Case 1: Penj n vmv (f2) = ll^~ 
presented in iFriedman et i 
glasso. Note that this implementation allows specification of different penalty levels for 
different entries in Q. 
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Case 2: Pen; nvcor (r2) = ||$~||i, where $ is the inverse correlation matrix. Note, that we 
can write = where V is a diagonal matrix with entries Vjj = Now, the 

objective function in (|A.7|) can be written as 

-21og|V| — log |*| +tr(VSVn)+p||*-||i. (A.8) 

or as 

— log |0] -tr(Sn) + £ -^=|O jy |. (A.9) 

Taking partial derivatives of (|A.8[) with respect to $>jj yields 

V), 1 \. S~ (j = l,..., P ). (A.10) 

By plugging-in (|A. 10|) into equation (|A.9p the desired optimization problem can be solved 
with the GLasso algorithm. 

Case 3: Pen parcor (f2) = where ^> is the partial correlation matrix. The objective 

function in (|A.7[) then equals 

-iog|n|-tr(sn) + i;-7?^s-l%l- (A.11) 

We solve (jA.llj) by setting f^°) = Diag(S) -1 and iteratively calling the GLasso algorithm 
according to: 

= argmin-log|0| - tr(Sfi) + V— ==J^=|%| (i = 0,1,2, .. .)• 

Note, that with the R-package glasso we can specify different penalty levels (here: 
pj \i Qj'jClj'^,, j,j' = 1, . . . ,p) for different entries in 0. 



A. 2 Performance 

We compare the penalty functions Peni nvcov , Pen parcor and Peni nvcor proposed in Section [2TT1 
under two different regimes: 

Model 5: Gaussian graph ical model with p = 5 and n = 100; Concentration matrix 
is generated as in Rothman et al. ( 20081 ) with p nonzero (off-diagonal) entries; 



Diagonal standardized to have entries equal one. 

Model 6: Gaussian graphical model with p = 50 and n = 100; Concentration matrix 
follows an AR(1) model with E W / = 0.9 1 '"'' 1 . Note that in the AR(1) model, the 
diagonal entries of £1 are not equal to one and therefore Q does not coincide with 
the partial correlation matrix. 

We generate training and test data for each model. For Model 5 and Model 6 we fit 
estimator IA.7I using different penalty functions and various tuning parameters (including 
A un j) on the training data and on scaled training data, where half of the variables are 
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scaled by 0.1 and the other half by 10 (in Model 5 the two halves are chosen randomly 
for each simulation run; in Model 6 one half are the variables 1, 3, 5, . . . , 49 and the other 
half the variables 2, 4, 6, ... , 50). We then report the log-likelihood obtained on test data. 

Boxplots in Figure [6] clearly demonstrate that penalization with Pen parcor and Peni nvcor is 
scale invariant, whereas penalizing the £i-norm of the inverse covariance matrix, which is 
the common practice in the Graphical Lasso, is not. The results in Figure [6] also agree 
with our findings in Section 12.21 Regularization with A UIU performs well with Peni nvcov 
only in Model 5, where £1 equals the partial correlation matrix. In both setting Peni nvcor 
with A un i does not perform well. However, penalizing the ^i-norm of the partial correlation 
matrix using A un j does a good job in both models, i.e., nearly as good as picking the best 
solution obtained with A op t- In Model 5, where coincides with the partial correlation 
matrix, Peni nvcov performs slightly better then Pen parcor as the former penalty is optimal 
from a distributional point of view. Similarly, in simulation Experiment I, where all S7's 
in the data generating models are standardized to have unit diagonal entries, we obtain 
slightly better performance if we use Pen; nvcov instead. 
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Figure 5: Genomic data, number of states. Number of states selected (at various training 
sample sizes) by Greedy Backward Pruning (Bwprun), HMMGLasso (Hmmgl), unpe- 
nalised MLE (Unpen) and MLE with diagonal restricted covariance matrices (Diagcov). 
All methods are trained on parts of chromosome 2L and use MMDL or BIC as the model 
selection criterion. The number of states in Hmmgl, Unpen and Diagcov are determined 
by a forward stepwise selection. 
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Figure 6: Test log-likelihood for different penalty functions for simulation Models 5 (left 
panel) and 6 (right panel). Red boxplots show results obtained from fitting on scaled 
training data and back-transforming parameter estimates on original scale. 



26 



Model 1 : K=2 



Model 1 : K=2 



i£ 30- 
S 25- 



CC 0.5 - 
S 0.4 - 



^ so- 
si 

"S 25- 



unpen.b unpen. m hmmgl.b hmmgl.ir 

Model 1 : K=4 



I 



E 



mclust diag.b 



unpen, b unpen. m hmmgl.b hmmgl.n 

Model 1 : K=6 



— 0.6 
(T 0.5 - 
S 0.4 - 
< 0.3 
0.2 



mclust diag.b diag.m unpen. b unpen. m hmmgl.b hmmgl.m bw.b bw.n 

Model 1 : K=4 




mclust diag.b dlag.rr 



unpen. b unpen. m hmmgl.b hmmgl.m bw.b bw.rr 

Model 1 : K=6 




diag.b diag.m unpen. b unpen. m hmmgl.b hmmgl.ir 



diag.b diag.m unpen. b unpen. m hmmgl.b hmmgl.m bw.b bw.n 



Figure 7: Simulation Experiment I, Model 1 (p = 10, n = 2000), number of states and 
state assignments. Left panels: frequency of estimated number of states (as in Fig [T] 
correct number of states indicated in black). Right panels: adjusted Rand index with 
respect to true state assignments. [Legend: Results for Mclust (mclust), MLE with 
diagonal covariance matrices (diag), MLE (unpen) and Greedy Backward Pruning (bw) 
are shown. Extensions ".b" and ".m" stand for BIC and MMDL respectively] 
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Model 2: K=2 
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Figure 8: Model 2 (p = 75, n = 2000), number of states and state assignments. Left pan- 
els: frequency of estimated number of states (correct number of states indicated in black). 
Right panels: adjusted Rand index with respect to true state assignments. [Legend: Re- 
sults for Mclust (mclust), MLE with diagonal covariance matrices (diag), MLE (unpen) 
and Greedy Backward Pruning (bw) are shown. Extensions ".b" and ".m" stand for BIC 
and MMDL respectively.] 
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Figure 9: Genomic data (Section 13. 2|) . multiple training datasets. Boxplots of 
MMDL(BIC)-scores, negative test log-likelihood (using test data as described in text) 
and number of selected states obtained over nine training datasets each of size retrain = 
1000. The ten datasets were obtained from chromosome 2L as Xt , ■ ■ ■ , -Xi _|_iooo with 
t = 0,500, 1000, 1500, ... ,4000. [Legend: Backward Pruning (bw), HMMGLasso (hm- 
mgl) and unpenalized MLE (unpen); extensions ".b" and ".m" stand for BIC respectively 
MMDL.] 
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